"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.
This lesson is the first in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.
This week will be your crash-course on Jupyter notebooks and R to refresh on packages and principles that will be relevant throughout our course. In our lectures and your assignments we will be working with some uncurated data to simulate the full experience of working with data from start to finish. It's important that we are all familiar with, and understand the majority of the tidy data methods that we'll be using in class so that we can focus the new material as it appears. We'll use some standard packages and practices to finesse our data before visualizing it so let's R-efresh ourselves.
At the end of this lecture we will have covered the following topics
tidyversegrey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Today's datasets will focus on epidemiological data from the Ontario provincial government found here.
This dataset was obtained from the Ontario provincial website and holds statistics regarding SARS-CoV-2 cases throughout different public health units in the province. It is in a comma separate format and has been collected since 2020-03-24.
This dataset was obtained from the Ontario provincial website and holds statistics regarding SARS-CoV-2 throughout the province. It is in a comma separated format and has been growing/expanding since initial tracking started on 2020-01-26.
repr- a package useful for altering some of the attributes of objects related to the R kernel.
tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2
viridis helps to create color-blind palettes for our data visualizations
lubridate and zoo are helper packages used for working with date formats in R
Let's run our first code cell!
# Packages to help tidy our data
library(tidyverse)
# Packages for the graphical analysis section
library(repr)
library(viridis)
# packages used for working with/formating dates in R
library(lubridate)
library(zoo)
-- Attaching packages --------------------------------------- tidyverse 1.3.0 -- v ggplot2 3.3.2 v purrr 0.3.4 v tibble 3.0.4 v dplyr 1.0.2 v tidyr 1.1.2 v stringr 1.4.0 v readr 1.4.0 v forcats 0.5.0 -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() Loading required package: viridisLite Attaching package: 'lubridate' The following objects are masked from 'package:base': date, intersect, setdiff, union Attaching package: 'zoo' The following objects are masked from 'package:base': as.Date, as.Date.numeric
Work with your Jupyter Notebook on the University of Toronto JupyterHub will all be contained within a new browser tab with the address bar showing something like
https://jupyter.utoronto.ca/user/assigned-username-hexadecimal/tree/2021.03_Adv_Graphics_R
All of this is running remotely on a University of Toronto server rather than your own machine.
You'll see a directory structure from your home folder:
ie \2021.03-Adv_Graphics_R\ and a folder to Lecture_01_R_Introduction within. Clicking on that, you'll find Lecture_01.R-efresher.skeleton.ipynb which is the notebook we will use for today's little code-along.
We've implemented the class this way to reduce the burden of having to install various programs. While installation can be a little tricky, it's really not that bad. For this introduction course, however, you don't need to go through all of that just to learn the basics of coding.
Jupyter Notebooks also give us the option of inserting "markdown" text much like what your reading at this very exact moment. So we can intersperse ideas and information between our learning code blocks.
There is, however an appendix section at the end of this lecture detailing how to install Jupyter Notebooks (and the R-kernel for them) as well as independent installation of the R-kernel itself and a great integrated development environment (IDE) called RStudio.
So... what are in these packages? A package can be a collection of
Functions are the basic workhorses of R; they are the tools we use to analyze our data. Each function can be thought of as a unit that has a specific task. A function takes an input, evaluates it using an expression (e.g. a calculation, plot, merge, etc.), and returns an output (a single value, multiple values, a graphic, etc.).
In this course we will rely a lot on a package called tidyverse which is also dependent upon a series of other packages.
Behind the scenes of each Jupyter notebook a programming kernel is running. For instance, depending on setup our notebooks can run a true or "emulated" R-kernel to interpret each code cell as if it were written specifically for the R language.
As we move from code cell to new code cell, all of variables or objects we have created are stored within memory. We can refer to these as we run the code and move forward but if you overwrite or change them by mistake, you may to have rerun multiple cell blocks!
There are some options in the "Cell" menu that can alleviate these problems such as "Run All Above". If you think you've made a big error by overwriting a key object, you can use that option to "re-initialize" all of your previous code!
The run order of your code is also visible at the side of each code cell as [x]. When a code cell is still actively running it will be denoted as [*] since a number cannot be assigned to it. You'll also notice your kernel (top right of the menu bar) has a small circle that will be dark while running, and clear while idle.
Remember these friendly keys/shortcuts:
Esc to enter "Command Mode" which basically takes you outside of the cell.Enter to edit a cellArrow keys to navigate up and down (and within a cell)Ctrl+Enter to run a cell (both code and markdown)Shift+Enter to run the current cell and move to the next one belowCtrl+/ to quickly comment and uncomment single or multiple lines of codeIn Command mode
a insert a new cell above the currently selected cellb insert a new cell below the currently selected cell
** Note that cells are defaulted to code cellsm converts a section to a markdown celly converts a section to a code cellr converts a section to a raw__ nbconvert cell. This is most helpful when wishing to preserve a code format without running it through the kernel. Depending on your needs, you may find yourself doing the following:
Jupyter allows you to alternate between "markdown" notes and "code" that can be run or re-run on the fly.
Each data run and it's results can be saved individually as a new notebook or as new cells to compare data and small changes to analyses!
Let's discuss some important behaviours before we begin coding:
#¶Why bother?
Your worst collaborator is potentially you in 6 days or 6 months. Do you remember what you had for breakfast last Tuesday?
You can annotate your code for selfish reasons, or altruistic reasons, but annotate your code.
How do I start?
It is, in general, part of best coding practices to keep things tidy and organized.
A hash-tag # will comment your text. Inside a code cell in a Jupyter Notebook or anywhere in an R script, all text after a hashtag will be ignored by R and by many other programming languages. It's very useful to add comments about changes in your code, as well as detailed explanations about your scripts.
Put a description of what you are doing near your code at every process, decision point, or non-default argument in a function. For example, why you selected k=6 for an analysis, or the Spearman over Pearson option for your correlation matrix, or quantile over median normalization, or why you made the decision to filter out certain samples.
Break your code into sections to make it readable. Scripts are just a series of steps and major steps should be titled/outlined with your reasoning - much like when presenting your research.
Give your objects informative object names that are not the same as function names.
Comments may/should appear in three places:
# At the beginning of the script, describing the purpose of your script and what you are trying to solve
bedmasAnswer <- 5 + 4 * 6 - 0 #In line: Describing a part of your code that is not obvious what it is for.
Maintaining well-documented code is also good for mental health!
Basically, you have the following options:
The most important aspects of naming conventions are being concise and consistent! Throughout this course you'll see a hybrid system that uses the underscore to separate words but a period right before denoting the object type ie this_data.object.
Use version control.
For more information on best coding practices, please visit swcarpentry
We all run into problems. We'll see a lot of mistakes happen in class too! That's OK if we can learn from our errors and quickly (or eventually) recover.
getwd() to check where you are working, typelist.files() or the Files pane to check that your file exists there, and setwd() to change your directory if necessary. Preferably, work inside an R project with all project-related files in that same folder. Your working directory will be set automatically when you open the project (this can be done by using File -> New Project and following prompts).typeof() and class() to check what type of data you have. Use str() to peak at your data structures if you're making assumptions about it.help("function"), ?function (using the name of the function that you want to check), or help(package = "package_name"). library("package_name"). If you only need one function from a package, or need to specify to what package a function belongs because there are functions with the same name that belong to different packages, you can use a double colon, i.e. package_name::function_name.session aborted can happen for a variety of reasons, like not having enough computational power to perform a task or also because of a system-wide failure. 0. You will need to rerun your previous cells! Including the program, version, error, package and function helps, be specific. Sometimes is useful include your operating system and version (Windows 10, Ubuntu 18, Mac OS 10, etc.).
You may run into assignment questions where the tools I've provided in lecture are not enough to reproduce the example output exactly as provided. If you wish to go that extra mile you may need to look for answers elsewhere by consulting references from the class or searching for it yourself.
Remember: Everyone looks for help online ALL THE TIME. It is very common. Also, with programming there are multiple ways to come up with an answer, even different packages that let you do the same thing in different ways. You will work on refining these aspects of your code as you go along in this course and in your coding career.
Last but not least, to make life easier: Under the Help pane, there is a cheatsheet of Jupyter notebook keyboard shortcuts or a browser list here.
There are many tips and tricks to remember about R but here we'll quickly recall some foundation knowledge that could be relevant in later lectures.
If we want to hold onto a number, calculation, or object we need to assign it to a named variable. R has multiple methods for assigning a value to a variable and an order of precedence!
-> and ->> Rightward assignment: we won't really be using this in our course.
<- and <<- Leftward assignment: assignment used by most 'authentic' R programmers but really just a historical throwback.
= Leftward assignment: commonly used token for assignment in many other programming languages but carries dual meaning!
(). What do I mean by 'types' of data?
a or aa or @c#o0* 7.51TRUE or FALSEThe job of data structures is to "host" the different data types. There are five types of data structures in R:
Also known as atomic vectors, each element within a vector must be of the same data type: logical, integer, double, character, complex, or raw.
For each vector there are two key properties that can be queried with typeof() and length().
There is a numerical order to a vector, much like a queue AND you can access each element (piece of data) individually or in groups. Elements are ordered from 1 to length(your_vector) and can be accessed with []
Elements of a vector may be named, to facilitate subsetting by character vectors.
Elements of a vector may be subset by a logical vector.
# Build a character vector
char.vector <- c("Canada", "United States", "Great Britain")
char.vector
# subset by a single value
char.vector[2]
# subset by multiple values
char.vector[2:3]
# subset by removing values (cannot be mixed with positive values)
char.vector[c(-1, -3)]
# subset with repeating multiple values
char.vector[c(1, 2, 3, 3, 2, 1)]
# Build a character vector but include variable names
character.vector <- c(a = "Canada", b = "United States", c = "Great Britain")
character.vector
# subset by element name
character.vector[c("a", "b")]
# # subset by a vector of logicals
character.vector[c(FALSE, TRUE, TRUE)]
character.vector[character.vector != "Canada"]
R will implicitly force (coerce) your vector to be of one data type, in this case the type that is most inclusive is a character vector. When we explicitly coerce a change from one data type to the next, it is known as casting. You can cast between certain data types and also object types.
as.logical(), as.integer(), as.double(), as.numeric(), as.character(), and as.factor()as.data.frame(), as.list(), and as.matrix()Importantly, when coercing, the R kernel convert from more specific to general types usually in this order:
# Make a logical vector
logical.vector <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
str(logical.vector)
# Make a numeric vector
numeric.vector <- c(-1:10)
str(numeric.vector)
# Make a mixed vector. Take a note of the type
mixed.vector <- c(FALSE, TRUE, 1, 2, "three", 4, 5, "six")
str(mixed.vector)
logi [1:5] TRUE FALSE TRUE FALSE FALSE int [1:12] -1 0 1 2 3 4 5 6 7 8 ... chr [1:8] "FALSE" "TRUE" "1" "2" "three" "4" "5" "six"
# Attempt to coerce our vectors
# logical to numeric
as.numeric(logical.vector)
# numeric to logical
as.logical(numeric.vector)
# numeric to character
as.character(numeric.vector)
# mixed to a numeric. Note what happens when elements cannot be converted
as.numeric(mixed.vector)
Warning message in eval(expr, envir, enclos): "NAs introduced by coercion"
Now that we have had the opportunity to create a few different vector objects, let's talk about what an object class is. An object class can be thought of as a structured with attributes that will behave a certain way when passed to a function. Because of this
Some R package developers have created their own object classes. For example, many of the functions in the tidyverse generate tibble objects. They are behave in most ways like a data frame but have a more refined print structure, making it easier to see information such as column types when viewing them quickly. In general, from a trouble-shooting standpoint, it is good to be aware that your data may need to be formatted to fit a certain class of object when using different packages.
After we are done tidying most of our datasets, they will be in tibble objects, but all of the basic data frame functions apply to these as well.
Whereas matrices are 2-dimensional structures limited to a single specific type of data within each instance, data frames are more complex as each column of the structure can be treated like a vector. The data frame, however, can have multiple data types mixed across each different columns. Data frame rules to remember are:
Data frames allows us to generate tables of mixed information much like an Excel spreadsheet.
# Generate a data frame with different variable/column types
mixed.df <- data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3])
mixed.df
str(mixed.df)
| country | values | commonwealth | |
|---|---|---|---|
| <chr> | <int> | <lgl> | |
| a | Canada | 0 | TRUE |
| b | United States | 1 | FALSE |
| c | Great Britain | 2 | TRUE |
'data.frame': 3 obs. of 3 variables: $ country : chr "Canada" "United States" "Great Britain" $ values : int 0 1 2 $ commonwealth: logi TRUE FALSE TRUE
nrow(data_frame) # retrieve the number of rows in a data frame
ncol(data_frame) # retrieve the number of columns in a data frame
data_frame$column_name # Access a specific column by it's name
data_frame[x,y] # Access a specific element located at row x, column y
rownames(data_frame) # retrieve or assign row names to your data frame
colnames(data_frame) # retrieve or assign columns names to your data frame
There are many more ways to access and manipulate data frames that we'll explore further down the road. Let's review some basic data frame code.
# query the dimensions of the data frame
dim(mixed.df)
nrow(mixed.df)
ncol(mixed.df)
# row and column names
rownames(mixed.df)
colnames(mixed.df)
# print the mixed data frame
mixed.df
# Access portions of the data frame
# a single column
str(mixed.df$country)
# a single element
mixed.df[2, 3]
mixed.df[3, "country"]
# multiple rows
mixed.df[c(1,3), ]
mixed.df[-2, ]
| country | values | commonwealth | |
|---|---|---|---|
| <chr> | <int> | <lgl> | |
| a | Canada | 0 | TRUE |
| b | United States | 1 | FALSE |
| c | Great Britain | 2 | TRUE |
chr [1:3] "Canada" "United States" "Great Britain"
| country | values | commonwealth | |
|---|---|---|---|
| <chr> | <int> | <lgl> | |
| a | Canada | 0 | TRUE |
| c | Great Britain | 2 | TRUE |
| country | values | commonwealth | |
|---|---|---|---|
| <chr> | <int> | <lgl> | |
| a | Canada | 0 | TRUE |
| c | Great Britain | 2 | TRUE |
Lists can hold mixed data types of different lengths. These are especially useful for bundling data of different types for passing around your scripts, to functions, or receiving output from functions! Rather than having to call multiple variables by name, you can store them in a single list!
If you forget what is in your list, use the str() function to check out its structure. It will tell you the number of items in your list and their data types.
# Make a named list of various items
mixed.list <- list(countries = character.vector, values = numeric.vector, mixed.data = mixed.df)
# Look at some information about our list
str(mixed.list)
names(mixed.list)
List of 3 $ countries : Named chr [1:3] "Canada" "United States" "Great Britain" ..- attr(*, "names")= chr [1:3] "a" "b" "c" $ values : int [1:12] -1 0 1 2 3 4 5 6 7 8 ... $ mixed.data:'data.frame': 3 obs. of 3 variables: ..$ country : chr [1:3] "Canada" "United States" "Great Britain" ..$ values : int [1:3] 0 1 2 ..$ commonwealth: logi [1:3] TRUE FALSE TRUE
# Lists can often by unnamed
unnamed.list <- list(character.vector, values = numeric.vector, mixed.df)
# Look at some information about our unnamed list
str(unnamed.list)
names(unnamed.list)
List of 3 $ : Named chr [1:3] "Canada" "United States" "Great Britain" ..- attr(*, "names")= chr [1:3] "a" "b" "c" $ values: int [1:12] -1 0 1 2 3 4 5 6 7 8 ... $ :'data.frame': 3 obs. of 3 variables: ..$ country : chr [1:3] "Canada" "United States" "Great Britain" ..$ values : int [1:3] 0 1 2 ..$ commonwealth: logi [1:3] TRUE FALSE TRUE
Accessing lists is much like opening up a box of boxes of chocolates. You never know what you're gonna get when you forget the structure!
You can access elements with a mixture of number and naming annotations much like data frames. Also [[x]] is meant to access the xth "element" of the list. Note that unnamed lists cannot be access with naming annotations.
[x] returns a list object with your element(s) of choice in the list. [[x]] returns a single element only# Subset our list with []
str(mixed.list[c(1, 3, 2)])
str(mixed.list["values"])
List of 3 $ countries : Named chr [1:3] "Canada" "United States" "Great Britain" ..- attr(*, "names")= chr [1:3] "a" "b" "c" $ mixed.data:'data.frame': 3 obs. of 3 variables: ..$ country : chr [1:3] "Canada" "United States" "Great Britain" ..$ values : int [1:3] 0 1 2 ..$ commonwealth: logi [1:3] TRUE FALSE TRUE $ values : int [1:12] -1 0 1 2 3 4 5 6 7 8 ... List of 1 $ values: int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
# Pull out a single element
str(mixed.list[[2]])
mixed.list[["countries"]]
# Give a vector as input to [[]]
mixed.list[[c(1,3)]]
# vs equivalent
mixed.list[[1]][3]
# Access a single element from a data frame nested in a list
mixed.list[[c(3, 1, 1)]]
# vs equivalient
mixed.list[[3]][1, 1]
int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
Ah, the dreaded factors! A factor is a class of object used to encode a character vector into categories. They are used to store categorical variables and although it is tempting to think of them as character vectors this is a dangerous mistake. Adding or changing data in a data frame with pre-existing factors requires that you match factor levels correctly as well.
Factors make perfect sense if you are a statistician designing a programming language (!) but to everyone else they exist solely to torment us with confusing errors. At its core, a factor is really just an integer vector or character data with an additional attribute, called levels(), which defines the accepted values for that variable.
Why not just use character vectors, you ask?
Believe it or not factors do have some useful properties. For example, factors allow you to specify all possible values a variable may take even if those values are not in your data set. Think of conditional formatting in Excel. We also use them heavily in generating statistical analyses and in grouping data when we want to visualize it.
Since the inception of R, data.frame() calls have been used to create data frames but the default behaviour was to convert strings (and characters) to factors! This is a throwback to the purpose of R, which was to perform statistical analyses on datasets with methods like ANOVA (lecture 06!) which examine the relationships between variables (ie factors)!
As R has become more popular and its applications and packages have expanded, incoming users have been faced with remembering this obscure behaviour, leading to lost hours of debugging grief as they wondering why they can't pull information from their dataframes to do a simple analysis on C. elegans strain abundance via molecular inversion probes in datasets of multiplexed populations. #SuspciouslySpecific
That meant that users usually had to create data frames including the toggle
data.frame(name=character(), value=numeric(), stringsAsFactors = FALSE)
Fret no more! As of R 4.x.x the default behaviour has switched and stringsAsFactors=FALSE is the default! Now if we want our characters to be factors, we must convert them specifically, or turn this behaviour on at the outset of creating each data frame!
# Generate a data frame and include factors
str(data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
stringsAsFactors = TRUE)
)
'data.frame': 3 obs. of 4 variables: $ country : Factor w/ 3 levels "Canada","Great Britain",..: 1 3 2 $ values : int 0 1 2 $ commonwealth: logi TRUE FALSE TRUE $ continent : Factor w/ 2 levels "Europe","North America": 2 2 1
# Explicitly define factors for each variable.
str(data.frame(country = factor(character.vector),
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
stringsAsFactors = FALSE)
)
'data.frame': 3 obs. of 4 variables: $ country : Factor w/ 3 levels "Canada","Great Britain",..: 1 3 2 $ values : int 0 1 2 $ commonwealth: logi TRUE FALSE TRUE $ continent : chr "North America" "North America" "Europe"
You can specify which columns of strings are converted to factors at the time of declaring your column information. Alternatively you can coerce character vectors to factors after generating them.
R by default puts factor levels in alphabetical order. This can cause problems if we aren't aware of it. You can check the order of your factor levels with the levels() command. Furthermore you can specify, during factor creation, your level order.
Always check to make sure your factor levels are what you expect.
With factors, we can deal with our character levels directly, or their numeric equivalents.
# Generate a data frame and include factors
str(data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = factor(c("North America", "North America", "Europe"),
levels = c("North America", "Europe"))
)
)
'data.frame': 3 obs. of 4 variables: $ country : chr "Canada" "United States" "Great Britain" $ values : int 0 1 2 $ commonwealth: logi TRUE FALSE TRUE $ continent : Factor w/ 2 levels "North America",..: 1 1 2
# Coerce a factor
mixed.df <- data.frame(country = character.vector,
values = numeric.vector[2:4],
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"))
# Set our factor after declaring the data frame
mixed.df$continent <- factor(mixed.df$continent, levels=c("North America", "Europe"))
str(mixed.df)
'data.frame': 3 obs. of 4 variables: $ country : chr "Canada" "United States" "Great Britain" $ values : int 0 1 2 $ commonwealth: logi TRUE FALSE TRUE $ continent : Factor w/ 2 levels "North America",..: 1 1 2
levels() to list the levels and their order for your factorrelevel().ordered = TRUE.labels = c(). Note that level order is assigned before labels are added to your data. You are essentially labeling the integer assigned to your factor levels so be careful when using this parameter!Yes, you can treat data frames and arrays like large lists where mathematical operations can be applied to individual elements or to entire columns or more!
Therefore be careful to specify your numeric data for mathematical operations.
mixed.df
mixed.df$values + 3
mixed.df$values * 4
# implicit coercion of logical to integer
mixed.df$commonwealth * 5
# Perform math on a factor
mixed.df$continent * 6
as.numeric(mixed.df$continent) * 7
# Can we perform math on non-numeric variables?
mixed.df$country + 8
| country | values | commonwealth | continent | |
|---|---|---|---|---|
| <chr> | <int> | <lgl> | <fct> | |
| a | Canada | 0 | TRUE | North America |
| b | United States | 1 | FALSE | North America |
| c | Great Britain | 2 | TRUE | Europe |
Warning message in Ops.factor(mixed.df$continent, 6): "'*' not meaningful for factors"
Error in mixed.df$country + 8: non-numeric argument to binary operator Traceback:
apply() family of functions to perform actions across data structures¶The above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.
apply() function will recognize basic functions and use them on vectorized data¶For example, we might have a count table where rows are genes, columns are samples, and we want to know the sum of all the counts for a gene. To do this, we can use the apply() function. apply() Takes an array, matrix (or something that can be coerced to such, like a numeric data frame), and applies a function over row (MARGIN = 1) or columns (MARGIN = 2). Here we can invoke the sum function.
# Make a sample data frame of numeric values only
numeric.df = data.frame(geneA = numeric.vector, geneB = numeric.vector*2, geneC = numeric.vector*3)
numeric.df
# Apply sum by columns
apply(numeric.df, ..., sum)
# Apply sum by rows
apply(numeric.df, ..., sum)
apply() family¶There are 3 additional members of the apply() family that perform similar functions with varying outputs
lapply(data, FUN, ...) is usuable on dataframes, lists, and vectors. It returns a list as output.FUN will be applied from the ...sapply(data, FUN, ...) works similarly to lapply() except it tries to simplify the output to the most elementary data structure possible. ie it will return the simplest form of the data that makes sense as a representation.mapply(FUN, data, ...) is short for "multivariate" apply and it applies a function to multiple lists or multiple vector arguments. # Use lapply on the columns of numeric.df
...(numeric.df, sum)
str(lapply(numeric.df, sum))
# Use sapply on the columns of numeric.df
...(numeric.df, sum)
str(sapply(numeric.df, sum))
# Using lapply and sapply and sum on an actual list
sum.list <- list(...)
# lapply on the list
lapply(sum.list, sum)
# sapply on the list
sapply(sum.list, sum)
# Use lapply to select portions from a list
sum.list <- list(numeric.df, numeric.df)
# Extract the first row from each member of the list
lapply(sum.list, ...)
# Extract the 2nd column from each member of the list
lapply(sum.list, "[", , 2)
# Take a close look at what sapply returns in this case
sapply(sum.list, "[", , 2)
Notice how in using sapply() to extract from a list of data frames, a single matrix was returned - a single output in the simplest form that maintains structure.
# Use mapply in an example on numeric.vector
mapply(sum, numeric.vector, numeric.vector)
# Use mapply in an example on numeric.df
mapply(sum, numeric.df, numeric.df)
# Use mapply on the rep function to see its output
mapply(rep, c(...), 4)
Missing values in R are handled as NA or (Not Available). Impossible values (like the results of dividing by zero) are represented by NaN (Not a Number). These types of values can be considered null values. These two types of values, specially NAs, have special ways to be dealt with otherwise it may lead to errors in the functions.
For our purposes, we are not interested in keeping NA data within our datasets so we will usually detect and remove them or replace them within our data after it is imported.
is.na() returns a logical vector reporting which values from your query are NA.complete.cases() returns a logical for row without any NA values.NA values with the na.rm = TRUE parameter: ie mean(), sum() etc.tidyr package can also be used to work with NA values.# Add some NAs to our data frame
mixed.df <- data.frame(country = character.vector,
values = c(3, NA, 9),
commonwealth = logical.vector[1:3],
continent = c("North America", "North America", "Europe"),
measure = c("metric", NA, "metric")
)
# Look at our updated data frame
mixed.df
# Which entries are NA?
is.na(mixed.df)
# Which rows are incomplete?
complete.cases(mixed.df)
# Use some math functions
sum(mixed.df$values, na.rm = TRUE)
| country | values | commonwealth | continent | measure | |
|---|---|---|---|---|---|
| <chr> | <dbl> | <lgl> | <chr> | <chr> | |
| a | Canada | 3 | TRUE | North America | metric |
| b | United States | NA | FALSE | North America | NA |
| c | Great Britain | 9 | TRUE | Europe | metric |
| country | values | commonwealth | continent | measure | |
|---|---|---|---|---|---|
| a | FALSE | FALSE | FALSE | FALSE | FALSE |
| b | FALSE | TRUE | FALSE | FALSE | TRUE |
| c | FALSE | FALSE | FALSE | FALSE | FALSE |
tidyverse¶Let's begin with some definitions:
In data science, long format is preferred over wide format because it allows for an easier and more efficient subset and manipulation of the data. To read more about wide and long formats, visit here.
Why tidy data?
Data cleaning (or dealing with 'messy' data) accounts for a huge chunk of data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.
In Tidy data:
This seems pretty straight forward, and it is. It is the datasets you get that will not be straight forward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.
readr package - "All roads lead to Rome.."¶... but not all roads are easy to travel.
Depending on format, data files can be opened in a number of ways. The simplest methods we will use involve the readr package as part of the tidyverse. These functions have already been developed to simplify the import process for users. The functions we will use most often are:
read_delim(), read_csv(), read_tsv(), read_csv2() [European datasets]read_lines()Let's read in our first dataset so that we can convert from wide to long format.
# Use read_csv to look at our PHU daily case data
covid_phu.df <- read_csv("./data/Ontario_daily_change_in_cases_by_phu.csv")
-- Column specification -------------------------------------------------------- cols( .default = col_double(), Date = col_date(format = "") ) i Use `spec()` for the full column specifications.
# Check the structure and characteristics of covid_phu
str(covid_phu.df)
tail(covid_phu.df)
tibble [354 x 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Date : Date[1:354], format: "2020-03-24" "2020-03-25" ... $ Algoma_Public_Health_Unit : num [1:354] NA 0 0 0 NA NA 3 0 1 0 ... $ Brant_County_Health_Unit : num [1:354] NA 1 0 0 NA NA 9 3 1 5 ... $ Chatham-Kent_Health_Unit : num [1:354] NA 0 0 0 NA NA 3 0 2 2 ... $ Durham_Region_Health_Department : num [1:354] NA 3 1 5 NA NA 56 21 24 25 ... $ Eastern_Ontario_Health_Unit : num [1:354] NA 0 0 0 NA NA 5 1 8 6 ... $ Grey_Bruce_Health_Unit : num [1:354] NA 1 0 1 NA NA 5 1 1 1 ... $ Haldimand-Norfolk_Health_Unit : num [1:354] NA 0 0 0 NA NA 3 4 15 10 ... $ Haliburton,_Kawartha,_Pine_Ridge_District_Health_Unit : num [1:354] NA 0 1 14 NA NA 12 10 8 9 ... $ Halton_Region_Health_Department : num [1:354] NA 1 4 1 NA NA 8 7 27 18 ... $ Hamilton_Public_Health_Services : num [1:354] NA 3 4 1 NA NA 38 17 7 20 ... $ Hastings_and_Prince_Edward_Counties_Health_Unit : num [1:354] NA 0 2 0 NA NA 3 0 1 4 ... $ Huron_Perth_District_Health_Unit : num [1:354] NA 0 0 0 NA NA 5 1 3 3 ... $ Kingston,_Frontenac_and_Lennox_&_Addington_Public_Health: num [1:354] NA 3 5 0 NA NA 12 8 11 4 ... $ Lambton_Public_Health : num [1:354] NA 0 0 5 NA NA 13 9 10 17 ... $ Leeds,_Grenville_and_Lanark_District_Health_Unit : num [1:354] NA 0 0 0 NA NA 8 8 8 7 ... $ Middlesex-London_Health_Unit : num [1:354] NA 0 2 4 NA NA 20 8 8 22 ... $ Niagara_Region_Public_Health_Department : num [1:354] NA 1 1 2 NA NA 22 10 8 18 ... $ North_Bay_Parry_Sound_District_Health_Unit : num [1:354] NA 0 0 1 NA NA 3 0 0 0 ... $ Northwestern_Health_Unit : num [1:354] NA 0 0 1 NA NA 1 0 0 1 ... $ Ottawa_Public_Health : num [1:354] NA 3 0 5 NA NA 52 9 37 118 ... $ Peel_Public_Health : num [1:354] NA 3 13 15 NA NA 95 42 21 114 ... $ Peterborough_Public_Health : num [1:354] NA 0 2 3 NA NA 18 2 1 9 ... $ Porcupine_Health_Unit : num [1:354] NA 0 3 0 NA NA 6 0 5 3 ... $ Region_of_Waterloo,_Public_Health : num [1:354] NA 2 0 3 NA NA 60 5 13 13 ... $ Renfrew_County_and_District_Health_Unit : num [1:354] NA 0 0 0 NA NA 1 1 2 3 ... $ Simcoe_Muskoka_District_Health_Unit : num [1:354] NA 0 1 4 NA NA 25 9 5 8 ... $ Southwestern_Public_Health : num [1:354] NA 0 0 2 NA NA 3 2 2 1 ... $ Sudbury_&_District_Health_Unit : num [1:354] NA 1 0 1 NA NA 1 2 2 3 ... $ Thunder_Bay_District_Health_Unit : num [1:354] NA 0 0 0 NA NA 2 1 1 0 ... $ Timiskaming_Health_Unit : num [1:354] NA 0 1 0 NA NA 0 1 0 0 ... $ Toronto_Public_Health : num [1:354] NA 17 21 22 NA NA 197 4 32 282 ... $ Wellington-Dufferin-Guelph_Public_Health : num [1:354] NA 1 1 0 NA NA 4 18 14 6 ... $ Windsor-Essex_County_Health_Unit : num [1:354] NA 1 2 0 NA NA 20 0 10 37 ... $ York_Region_Public_Health_Services : num [1:354] NA 5 5 34 NA NA 94 16 25 109 ... $ Total : num [1:354] 0 46 69 124 0 0 807 220 313 878 ... - attr(*, "spec")= .. cols( .. Date = col_date(format = ""), .. Algoma_Public_Health_Unit = col_double(), .. Brant_County_Health_Unit = col_double(), .. `Chatham-Kent_Health_Unit` = col_double(), .. Durham_Region_Health_Department = col_double(), .. Eastern_Ontario_Health_Unit = col_double(), .. Grey_Bruce_Health_Unit = col_double(), .. `Haldimand-Norfolk_Health_Unit` = col_double(), .. `Haliburton,_Kawartha,_Pine_Ridge_District_Health_Unit` = col_double(), .. Halton_Region_Health_Department = col_double(), .. Hamilton_Public_Health_Services = col_double(), .. Hastings_and_Prince_Edward_Counties_Health_Unit = col_double(), .. Huron_Perth_District_Health_Unit = col_double(), .. `Kingston,_Frontenac_and_Lennox_&_Addington_Public_Health` = col_double(), .. Lambton_Public_Health = col_double(), .. `Leeds,_Grenville_and_Lanark_District_Health_Unit` = col_double(), .. `Middlesex-London_Health_Unit` = col_double(), .. Niagara_Region_Public_Health_Department = col_double(), .. North_Bay_Parry_Sound_District_Health_Unit = col_double(), .. Northwestern_Health_Unit = col_double(), .. Ottawa_Public_Health = col_double(), .. Peel_Public_Health = col_double(), .. Peterborough_Public_Health = col_double(), .. Porcupine_Health_Unit = col_double(), .. `Region_of_Waterloo,_Public_Health` = col_double(), .. Renfrew_County_and_District_Health_Unit = col_double(), .. Simcoe_Muskoka_District_Health_Unit = col_double(), .. Southwestern_Public_Health = col_double(), .. `Sudbury_&_District_Health_Unit` = col_double(), .. Thunder_Bay_District_Health_Unit = col_double(), .. Timiskaming_Health_Unit = col_double(), .. Toronto_Public_Health = col_double(), .. `Wellington-Dufferin-Guelph_Public_Health` = col_double(), .. `Windsor-Essex_County_Health_Unit` = col_double(), .. York_Region_Public_Health_Services = col_double(), .. Total = col_double() .. )
| Date | Algoma_Public_Health_Unit | Brant_County_Health_Unit | Chatham-Kent_Health_Unit | Durham_Region_Health_Department | Eastern_Ontario_Health_Unit | Grey_Bruce_Health_Unit | Haldimand-Norfolk_Health_Unit | Haliburton,_Kawartha,_Pine_Ridge_District_Health_Unit | Halton_Region_Health_Department | ... | Simcoe_Muskoka_District_Health_Unit | Southwestern_Public_Health | Sudbury_&_District_Health_Unit | Thunder_Bay_District_Health_Unit | Timiskaming_Health_Unit | Toronto_Public_Health | Wellington-Dufferin-Guelph_Public_Health | Windsor-Essex_County_Health_Unit | York_Region_Public_Health_Services | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 2021-03-07 | 0 | 12 | 5 | 58 | 9 | 2 | 12 | 5 | 39 | ... | 36 | 3 | 34 | 53 | 3 | 329 | 38 | 32 | 116 | 1299 |
| 2021-03-08 | 0 | 20 | 5 | 68 | 15 | 0 | 4 | 4 | 51 | ... | 48 | 4 | 27 | 91 | 1 | 568 | 10 | 46 | 119 | 1631 |
| 2021-03-09 | 0 | 6 | 11 | 25 | 10 | 3 | 3 | 1 | 48 | ... | 30 | 5 | 24 | 39 | 1 | 343 | 10 | 30 | 105 | 1185 |
| 2021-03-10 | 0 | 14 | 9 | 48 | 11 | 1 | 6 | 4 | 48 | ... | 31 | 7 | 13 | 67 | 0 | 428 | 8 | 23 | 149 | 1316 |
| 2021-03-11 | 0 | 7 | 10 | 36 | 18 | 5 | 2 | 5 | 33 | ... | 43 | 6 | 11 | 48 | 0 | 294 | 3 | 39 | 79 | 1092 |
| 2021-03-12 | 1 | 11 | 10 | 35 | 12 | -1 | 6 | 4 | 34 | ... | 43 | 3 | 37 | 52 | 0 | 371 | 19 | 39 | 111 | 1371 |
From looking at our data public health unit data, we can see that it begins tracking on 2020-03-24 and goes up until 2021-03-12. In total there are observations for 354 days across 34 public health units. The final column appears to be a tally running for total cases reported on that date.
From the outset, we can see there are some issues with the data set that we'll want to resolve and we'll work through some tidyverse functions in order to do that. First let's quickly review some of the potential problems with our dataset.
new_cases for each Date observation. At the same time we will not collapse Total into that same variable.NA values. Many instance are likely due to no data being collected on those dates. For our purposes, it may be simpler to replace them with a value of 0.Before we tackle these issues, let's go ahead and review some of the tools at our disposal.
tidyverse package and it's contents make manipulating data easier¶While the tidyverse is composed of multiple packages, we will be focused on working with a subset of these: dplyr, tidyr, and stringr.
%>% whenever you can!¶To save on making extra variables in memory and to help make our code more concise, we should use of the %>% symbol. This is a redirction or pipe symbol similar to the | in Unix operating systems and is used for redirecting output from one function to the input of another. By thoughtfully combining this with other commands, we can alter or query our datasets with ease.
We'll also introduce the %<>% in this class. This is a little more advanced but it allows us to assign the final product of our chain of commands to the very first object.
Whenever we are redirecting, we are implicitly passing our output to the first parameter of the next function. We may not always want to use the entirety of the output or we may want to also reuse that redirected output as part of another parameter. To do so we can use . to explicitly denote the redirected output.
dplyr has functions for accessing and altering your data¶We will use the "verbs" of the dplyr function often to massage the look of our data by changing column names or subsetting it. The most common verbs you will see in this course are.
arrange()count(), tally()distinct()filter()mutate(), transmute()select()summarize() or summarise()group_by(), reversed by ungroup()rename(), and relocate()tidyr has additional functions for reshaping our data¶The tidyr package will be most useful when we are trying to reshape our data from the wide to the long format or vice versa. This is much more useful for when we want to drastically alter portions or all of our data.
pivot_longer() or previously gather()pivot_wider() or previously spread()extract()separate()unite()drop_na()replace_na()stringr provides functionality for searching data based on regular expressions¶The stringr package will come in most useful when we are trying to fix string issues with our data. Many time our headers or data will contain spaces or poor formatting. Many times we will prefer to have our headers in lower case format, without any spaces replaced by an _. We'll also use verbs from this package to make any variables or data more concise.
str_count()str_detect()str_extract() and str_extract_all()str_match() and str_match_all()str_remove() and str_remove_all()str_split(), str_split_fixed(), and str_split_n()str_subset() and str_which()stringr helper functions
str_to_upper() and str_to_lower()str_c()str_flatten()str_sub()pivot_longer()¶As you may recall, our PHU data is formatted such that each column represents new cases per daily for a single PHU. It's a great way to format for data entry and certainly reduces on redundancy. However, for us to work with this data, we want to collapse all of those PHUs into a single column.
Previously you may have used gather() from the tidyr package to melt wide data into a long format. Today we will use an actively developed version of this function called pivot_longer() which, for our purposes, will rely on four parameters:
data: the data frame (and columns) that we wish to transform.cols: the columns that we wish to gather/collapse into a long format.names_to: the variable name of the new column to hold the collapsed information from our current columns.values_to: The variable name of the values for each observation that we are collapsing down.We'll be using a series of %>% so for now we won't save our work to a new object.
# Start with our wide-format phu data
covid_phu.df %>%
# Pivot the data into a long-format set
pivot_longer(cols= c(2:35), names_to = "public_health_unit", values_to = "new_cases") %>%
# Just take a quick look at the output.
str()
tibble [12,036 x 4] (S3: tbl_df/tbl/data.frame) $ Date : Date[1:12036], format: "2020-03-24" "2020-03-24" ... $ Total : num [1:12036] 0 0 0 0 0 0 0 0 0 0 ... $ public_health_unit: chr [1:12036] "Algoma_Public_Health_Unit" "Brant_County_Health_Unit" "Chatham-Kent_Health_Unit" "Durham_Region_Health_Department" ... $ new_cases : num [1:12036] NA NA NA NA NA NA NA NA NA NA ...
NA values from our data with replace_na()¶Our conversion to long format creates 12,036 observations relating a Date to a new_cases value in a specific Public_Health_Unit (or total). From the looks of our data, however, we have a number of NA values under our new_cases variable.
We have two options:
NA observations from our data set. There won't be any loss of information since we could rebuild the original data if we really needed to.NA observations with a value that makes sense for our analysis.Let's replace the missing observations with a new value, 0, using replace_na(). This function will need two parameters:
data: the data frame or vector that it will scan for NA values.replace: the value that we will use to replace NA.We're going to update our pipe of commands and save the final output into a new variable covid_phu_long.df.
# Pivot the data into a long-format set and remove NAs from the value table
covid_phu_long.df <- covid_phu.df %>%
pivot_longer(cols = c(2:35), names_to = "public_health_unit", values_to = "new_cases") %>%
# Change the values of "new_cases" using the mutate function
mutate(new_cases = replace_na(data = .$new_cases, replace = 0))
# Check that we have covered all of the NA values in our data frame by looking for complete cases
nrow(covid_phu_long.df[complete.cases(covid_phu_long.df),])
# Take a look at the Public Health Unit names
print(unique(covid_phu_long.df$public_health_unit))
[1] "Algoma_Public_Health_Unit" [2] "Brant_County_Health_Unit" [3] "Chatham-Kent_Health_Unit" [4] "Durham_Region_Health_Department" [5] "Eastern_Ontario_Health_Unit" [6] "Grey_Bruce_Health_Unit" [7] "Haldimand-Norfolk_Health_Unit" [8] "Haliburton,_Kawartha,_Pine_Ridge_District_Health_Unit" [9] "Halton_Region_Health_Department" [10] "Hamilton_Public_Health_Services" [11] "Hastings_and_Prince_Edward_Counties_Health_Unit" [12] "Huron_Perth_District_Health_Unit" [13] "Kingston,_Frontenac_and_Lennox_&_Addington_Public_Health" [14] "Lambton_Public_Health" [15] "Leeds,_Grenville_and_Lanark_District_Health_Unit" [16] "Middlesex-London_Health_Unit" [17] "Niagara_Region_Public_Health_Department" [18] "North_Bay_Parry_Sound_District_Health_Unit" [19] "Northwestern_Health_Unit" [20] "Ottawa_Public_Health" [21] "Peel_Public_Health" [22] "Peterborough_Public_Health" [23] "Porcupine_Health_Unit" [24] "Region_of_Waterloo,_Public_Health" [25] "Renfrew_County_and_District_Health_Unit" [26] "Simcoe_Muskoka_District_Health_Unit" [27] "Southwestern_Public_Health" [28] "Sudbury_&_District_Health_Unit" [29] "Thunder_Bay_District_Health_Unit" [30] "Timiskaming_Health_Unit" [31] "Toronto_Public_Health" [32] "Wellington-Dufferin-Guelph_Public_Health" [33] "Windsor-Essex_County_Health_Unit" [34] "York_Region_Public_Health_Services"
str_replace_all()¶Looking at our PHU names, we can see that there is a lot of redundancy in our names. We see they end in some form of:
We have a couple of choices but we can either use str_replace_all() or a specific version of that, str_remove_all(), which simply replaces a pattern with an empty character.
For str_replace_all() we will supply:
string: a single string or vector of strings.pattern: the pattern we wish to search for in the form of a string or regular expression.replace: the replacement string we wish to use.We also see the odd "," here but we'll actually perform a second replacement on the updated strings and convert all of the underscores with spaces. To wrap that up we'll convert our updated variable to a factor and overwrite our original covid_phu_long.df.
We will accomplish this all through multiple calls to mutate.
# Clean up the Public Health Unit names
covid_phu_long.df %<>%
# Replaces our public_health_unit values with ones where we remove excess verbage
mutate(public_health_unit = str_replace_all(string = .$public_health_unit,
pattern = c("(_Public\\w*)|(,_Public\\w*)|(_Health\\w*)"),
replace = "")) %>%
# From the updatedversion of public_health_unit, replace all _ with a " "
mutate(public_health_unit = str_replace_all(string = .$public_health_unit,
pattern = "_",
replace = " ")) %>%
# Now make sure that it's a factor for later
mutate(public_health_unit = as.factor(public_health_unit))
# Take a look at the new set of phu names
print(levels(covid_phu_long.df$public_health_unit))
[1] "Algoma" [2] "Brant County" [3] "Chatham-Kent" [4] "Durham Region" [5] "Eastern Ontario" [6] "Grey Bruce" [7] "Haldimand-Norfolk" [8] "Haliburton, Kawartha, Pine Ridge District" [9] "Halton Region" [10] "Hamilton" [11] "Hastings and Prince Edward Counties" [12] "Huron Perth District" [13] "Kingston, Frontenac and Lennox & Addington" [14] "Lambton" [15] "Leeds, Grenville and Lanark District" [16] "Middlesex-London" [17] "Niagara Region" [18] "North Bay Parry Sound District" [19] "Northwestern" [20] "Ottawa" [21] "Peel" [22] "Peterborough" [23] "Porcupine" [24] "Region of Waterloo" [25] "Renfrew County and District" [26] "Simcoe Muskoka District" [27] "Southwestern" [28] "Sudbury & District" [29] "Thunder Bay District" [30] "Timiskaming" [31] "Toronto" [32] "Wellington-Dufferin-Guelph" [33] "Windsor-Essex County" [34] "York Region"
# Take a quick look at our final dataset
head(covid_phu_long.df)
| Date | Total | public_health_unit | new_cases |
|---|---|---|---|
| <date> | <dbl> | <fct> | <dbl> |
| 2020-03-24 | 0 | Algoma | 0 |
| 2020-03-24 | 0 | Brant County | 0 |
| 2020-03-24 | 0 | Chatham-Kent | 0 |
| 2020-03-24 | 0 | Durham Region | 0 |
| 2020-03-24 | 0 | Eastern Ontario | 0 |
| 2020-03-24 | 0 | Grey Bruce | 0 |
rename() variables for clarity¶Now that we have the basic structure for our data, we want to clean it up just a little bit by renaming our Total column to clarify that it represents total new cases across all PHUs for that date. Why did we keep this column separate? Now we can use this information to generate percentage totals for each PHU if we choose to. We'll also change our Date column to lower case at the same time.
We'll use rename() from dplyr to accomplish the task of renaming our column. There are a number of ways you could accomplish this without using dplyr but the simplicity of it is nice.
# Rename our Total column to clarify it's meaning
covid_phu_long.df %>%
rename(total_phu_new = Total,
date = Date) %>%
head()
| date | total_phu_new | public_health_unit | new_cases |
|---|---|---|---|
| <date> | <dbl> | <fct> | <dbl> |
| 2020-03-24 | 0 | Algoma | 0 |
| 2020-03-24 | 0 | Brant County | 0 |
| 2020-03-24 | 0 | Chatham-Kent | 0 |
| 2020-03-24 | 0 | Durham Region | 0 |
| 2020-03-24 | 0 | Eastern Ontario | 0 |
| 2020-03-24 | 0 | Grey Bruce | 0 |
relocate()¶The last cleanup we can accomplish with our data is to move total_phu_new to the last column of our data frame. This is for personal preference but also makes more sense when simply looking at the data. The relocate() verb from dplyr accomplishes this with ease since we are not dropping or removing columns. It uses some extra syntax to help accomplish its functions:
.data: the data frame or tibble we want to alter...: the columns we wish to move.before or .after: determines the destination of the columns. Supplying neither will move columns to the left-hand side.In fact, relocate() can be used to rename a column as well but it will also be moved by default so consider the ramifications of such an action!
Note: We could accomplish a similar result using the select command as well. It's really up to what you're comfortable with but it is much simpler to use relocate() when you are working with a large number of columns and you want to move one to a specific location.
# Rename our Total column to clarify it's meaning
covid_phu_long.df %>%
rename(total_phu_new = Total,
date = Date) %>%
# relocate our total column to the right side
relocate(total_phu_new, .after = new_cases) %>% head()
| date | public_health_unit | new_cases | total_phu_new |
|---|---|---|---|
| <date> | <fct> | <dbl> | <dbl> |
| 2020-03-24 | Algoma | 0 | 0 |
| 2020-03-24 | Brant County | 0 | 0 |
| 2020-03-24 | Chatham-Kent | 0 | 0 |
| 2020-03-24 | Durham Region | 0 | 0 |
| 2020-03-24 | Eastern Ontario | 0 | 0 |
| 2020-03-24 | Grey Bruce | 0 | 0 |
# Relocate our target column using the select() command
covid_phu_long.df %<>%
rename(total_phu_new = Total,
date = Date) %>%
# relocate our total column to the right side
select(1, 3, 4, 2)
head(covid_phu_long.df)
| date | public_health_unit | new_cases | total_phu_new |
|---|---|---|---|
| <date> | <fct> | <dbl> | <dbl> |
| 2020-03-24 | Algoma | 0 | 0 |
| 2020-03-24 | Brant County | 0 | 0 |
| 2020-03-24 | Chatham-Kent | 0 | 0 |
| 2020-03-24 | Durham Region | 0 | 0 |
| 2020-03-24 | Eastern Ontario | 0 | 0 |
| 2020-03-24 | Grey Bruce | 0 | 0 |
At this point we have completed the data wrangling we want to accomplish on this dataset. We've converted it to a long-format and renamed the PHU entries while removing an NA values that may cause issues. There are a number of ways we could save this data now either as a text file or in its current form as a data frame in a .RData format.
write_delim(), write_csv(), write_tsv(), write_excel_csv()write_lines() save()load()Let's try some of those methods now.
# Check the files names we currently have
print(dir("./data/"))
[1] "Ontario_covidtesting.csv" [2] "Ontario_daily_change_in_cases_by_phu.csv" [3] "Ontario_daily_change_in_cases_by_phu_long.RData" [4] "Ontario_daily_change_in_cases_by_phu_long.tsv"
# Write covid_phu_long.df to a tab-delimited file
write_tsv(covid_phu_long.df, file = "./data/Ontario_daily_change_in_cases_by_phu_long.tsv")
# Check our file names after writing
print(dir("./data/"))
[1] "Ontario_covidtesting.csv" [2] "Ontario_daily_change_in_cases_by_phu.csv" [3] "Ontario_daily_change_in_cases_by_phu_long.RData" [4] "Ontario_daily_change_in_cases_by_phu_long.tsv"
# Save our data frame as an object
save(covid_phu_long.df, file="./data/Ontario_daily_change_in_cases_by_phu_long.RData")
# Check our file names after saving
print(dir("./data/"))
[1] "Ontario_covidtesting.csv" [2] "Ontario_daily_change_in_cases_by_phu.csv" [3] "Ontario_daily_change_in_cases_by_phu_long.RData" [4] "Ontario_daily_change_in_cases_by_phu_long.tsv"
readxl and writexl packages for working with excel spreadsheets¶Not all of your data may come as a comma- or tab-delimited format. In the case of excel spreadsheets there are some packages available that can also facilitate the parsing of these more complex files. The readxl package is part of the tidyverse but writexl package is not. There are other means of writing to an excel file format but they are dependent on other programs (like Java or Excel) or their drivers.
From the readxl package
excel_sheets()read_excel()From the writexl package (not a part of the tidyverse) but independent of Java and Excel
write_xlsx()ggplot2¶We now have some data in a tidy format that we'd like to visualize. We can begin with some initial analyses of the data using the ggplot2 package. It has all of the components we need to help us decide on which data we want to focus on or keep. There are a number of way to visualize our data and here we will refresh our ggplot skills.
Basic ggplot notes:
ggplot objects hold a complex number of attributes but always need an initial source of dataggplot objects can be modified with the + symbol by adding in layersggplot objects can be plotted, saved, and passed around.# Adjust our plot window sizes to go a little wider
options(repr.plot.width=21, repr.plot.height=7)
# Initialize a plot with our data
phu.plot <- ggplot(covid_phu_long.df)
# Take a quick look at the structure of the data
str(phu.plot)
List of 9
$ data : tibble [12,036 x 4] (S3: tbl_df/tbl/data.frame)
..$ date : Date[1:12036], format: "2020-03-24" "2020-03-24" ...
..$ public_health_unit: Factor w/ 34 levels "Algoma","Brant County",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ new_cases : num [1:12036] 0 0 0 0 0 0 0 0 0 0 ...
..$ total_phu_new : num [1:12036] 0 0 0 0 0 0 0 0 0 0 ...
$ layers : list()
$ scales :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg>
add: function
clone: function
find: function
get_scales: function
has_scale: function
input: function
n: function
non_position_scales: function
scales: NULL
super: <ggproto object: Class ScalesList, gg>
$ mapping : Named list()
..- attr(*, "class")= chr "uneval"
$ theme : list()
$ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto', 'gg' <ggproto object: Class CoordCartesian, Coord, gg>
aspect: function
backtransform_range: function
clip: on
default: TRUE
distance: function
expand: TRUE
is_free: function
is_linear: function
labels: function
limits: list
modify_scales: function
range: function
render_axis_h: function
render_axis_v: function
render_bg: function
render_fg: function
setup_data: function
setup_layout: function
setup_panel_guides: function
setup_panel_params: function
setup_params: function
train_panel_guides: function
transform: function
super: <ggproto object: Class CoordCartesian, Coord, gg>
$ facet :Classes 'FacetNull', 'Facet', 'ggproto', 'gg' <ggproto object: Class FacetNull, Facet, gg>
compute_layout: function
draw_back: function
draw_front: function
draw_labels: function
draw_panels: function
finish_data: function
init_scales: function
map_data: function
params: list
setup_data: function
setup_params: function
shrink: TRUE
train_scales: function
vars: function
super: <ggproto object: Class FacetNull, Facet, gg>
$ plot_env :<environment: R_GlobalEnv>
$ labels : Named list()
- attr(*, "class")= chr [1:2] "gg" "ggplot"
We now have a basic plot object initialized but we need to tell it how to display the data associated with it. We'll begin with a simple line graph of all the public health units across all dates within the set.
In order to update or add layers to a ggplot object, we can use the + symbol for each command. For instance, to define the source of x-axis and y-axis data, we use aes() command to update the aesthetics layer. Remember how we defined the public_health_unit variable as a factor? We'll take advantage of that here and tell ggplot to give each PHU it's own colour.
After defining our aesthetics, we still need to tell ggplot how to actually graph the data. The ggplot package comes with an abundance of visualizations accessed through the geom_*() commands. Some examples include
geom_point() for scatterplotsgeom_line() for line graphsgeom_boxplot() for boxplotsgeom_violin() for violin plotsgeom_bar() for bargraphsgeom_histogram() for histograms# Update the aesthetics with axis and colour information, then add a line graph!
phu.plot +
aes(x = date, y = new_cases, colour = public_health_unit) +
geom_line() +
theme(text = element_text(size = 20)) + # set text size
guides(colour = guide_legend(title="Public Health Unit")) + # Legend title
xlab("Date") + # Set the x-axis label
ylab("New cases") # Set the y-axis label
facet_wrap() command to break PHUs into separate graphs¶There's a lot of data on that graph and some of it is quite drowned out because of the scale of PHUs with many more cases. To break out each PHU individually, we can add the facet_wrap() command. We'll also update some of the parameters:
scale: we will update this so each y-axis scale is determined by PHU-specific data.ncol: use this to set the number of columns displayed in our gridAt the same time, we'll also get rid of the legend since each individual graph will be labeled by its PHU.
# This is going to be a big graph so adjust our plot window sizes for us
options(repr.plot.width=20, repr.plot.height=30)
# Add a facet_wrap and get rid of the legend
phu_facet.plot <- phu.plot +
aes(x = date, y = new_cases, colour = public_health_unit) +
geom_line() +
theme(legend.position = "none") +
theme(text = element_text(size = 20)) + # set text size
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per day across Ontario Public Health Units") +
# Facet our data by PHU
facet_wrap(~ public_health_unit, scales = "free_y", ncol = 4)
# Display our plot
phu_facet.plot
ggsave() command to save your plots to a file¶There are a number of ways you can use the ggsave() command to specify how you want to save your files.
# What is our working director?
getwd()
# Save the plot we've generated to the root directory of the lecture files.
ggsave(plot = phu_facet.plot, filename = "Ontario_phu_data.all.facet.png", scale=2, device = "png", units = c("cm"))
# Take a look at the directory
dir()
Saving 33.9 x 33.9 cm image
Although we do have a running total for each date, what if we want to look at the totals cases across subsets of the PHUs? Using a barplot we can stack cases by date and get a sense of daily case totals from which sets of PHUs we desire.
This time we will use geom_bar() to display our data and tell it to use the values from our new_cases variable to generate the totals. We do this by setting the stat = "identity" parameter.
At the same time, let's update our colours to use a colour-blind friendly palette scheme.
# This is going to be a simpler graph so adjust our plot window size accordingly
options(repr.plot.width=20, repr.plot.height=10)
phu.plot +
aes(x = date, y= new_cases, fill = public_health_unit) + # set our fill colour instead of line colour
theme(text = element_text(size = 20)) + # set text size
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per day across all Ontario Public Health Units") +
# Set up our barplot here
geom_bar(stat = "identity") +
scale_fill_viridis_d() # the "d" stands for discrete colour scale
From above we get a sense of overall totals for some PHU distributions but it's still too much to look at. Let's transform our x-axis values so we can bin by months instead. To accomplish this we'll use the as.yearmon() function found in the zoo package we loaded at the beginning of the lecture.
phu.plot +
aes(x = as.yearmon(date), y= new_cases, fill = public_health_unit) + # set our fill colour instead of line colour
theme(text = element_text(size = 20)) + # set text size
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per month across all Ontario Public Health Units") +
# Set up our barplot here
geom_bar(stat = "identity") +
scale_fill_viridis_d() # the "d" stands for discrete colour scale
Now that we have taken an initial look at our data, we can see that even after converting our axis to a month-year format, it appears that some of the data isn't that relevant for us. Some of the PHUs are not generating many new cases per day so we can now consider slicing our data up to look at specific regions.
Let's look at the top 10 regions by total caseload across the dataset.
# What are the top 10 regions by total caseload?
covid_phu_long.df %>%
# group the data by public health unit
group_by(public_health_unit) %>%
# Summarize it by the total number of new cases in each PHU
summarise(total_cases = sum(new_cases)) %>%
# Sort all of the data in descending order by total cases
arrange(desc(total_cases)) %>%
# take the top 10 PHUs
.[1:10, ]
`summarise()` ungrouping output (override with `.groups` argument)
| public_health_unit | total_cases |
|---|---|
| <fct> | <dbl> |
| Toronto | 98043 |
| Peel | 63364 |
| York Region | 29864 |
| Ottawa | 15352 |
| Windsor-Essex County | 13292 |
| Durham Region | 12277 |
| Region of Waterloo | 11154 |
| Hamilton | 11084 |
| Halton Region | 9595 |
| Niagara Region | 8849 |
# Generate a list of all PHUs and sort by total caseload
# Generate a list of all PHUs and sort by total caseload
phu_by_total_cases_desc <- covid_phu_long.df %>%
# Group by public health unit
group_by(public_health_unit) %>%
# Based on public health unit, sum the total cases
summarise(total_cases = sum(new_cases)) %>%
# Sort by descending order
arrange(desc(total_cases)) %>%
# Grab the PHU names and convert them into a character vector
select(public_health_unit) %>%
unlist() %>%
as.character() # Coercion to a vector removes the names. unname() works as well.
# Take a look at the public health units
print(phu_by_total_cases_desc)
`summarise()` ungrouping output (override with `.groups` argument)
[1] "Toronto" [2] "Peel" [3] "York Region" [4] "Ottawa" [5] "Windsor-Essex County" [6] "Durham Region" [7] "Region of Waterloo" [8] "Hamilton" [9] "Halton Region" [10] "Niagara Region" [11] "Simcoe Muskoka District" [12] "Middlesex-London" [13] "Wellington-Dufferin-Guelph" [14] "Eastern Ontario" [15] "Southwestern" [16] "Lambton" [17] "Thunder Bay District" [18] "Brant County" [19] "Haldimand-Norfolk" [20] "Chatham-Kent" [21] "Huron Perth District" [22] "Haliburton, Kawartha, Pine Ridge District" [23] "Leeds, Grenville and Lanark District" [24] "Sudbury & District" [25] "Peterborough" [26] "Kingston, Frontenac and Lennox & Addington" [27] "Grey Bruce" [28] "Northwestern" [29] "Hastings and Prince Edward Counties" [30] "Renfrew County and District" [31] "Porcupine" [32] "North Bay Parry Sound District" [33] "Algoma" [34] "Timiskaming"
filter() command to make a subset of our data¶Now that we have a list of PHUs ordered by descending total cases, we can use that to filter our covid_phu_long.df dataframe and graph only the more heavily infected PHUs. We can then pipe the filtered data over to make a ggplot() object. At the same time we'll do a few more things:
# Make a bar graph
covid_phu_long.df %>%
# Filter our data based on the PHUs we want to see
filter(public_health_unit %in% phu_by_total_cases_desc[1:3]) %>%
# Redirect our new data frame to ggplot
ggplot(.) +
# set our fill colour by reordering the levels of the data supplied
aes(x = as.yearmon(date), y= new_cases, fill = fct_reorder(public_health_unit, new_cases)) +
theme(text = element_text(size = 20)) + # set text size
guides(fill = guide_legend(title="Public Health Unit")) +
xlab("Date") + # Set the x-axis label
ylab("New cases") + # Set the y-axis label
ggtitle("New cases per month across top 3 Ontario Public Health Units") +
# Set up our barplot here
geom_bar(stat = "identity") +
scale_fill_viridis_d() # the "d" stands for discrete colour scale
We can see from our first graph of daily case loads that there can be quite a bit of variability from day to day. Rather than look at the daily tally of new cases, perhaps we can take into account the overall number of new cases appearing in a 14-day sliding window. Given that symptoms from time of infection can take between 5-14 days to manifest, then a portion of daily positive cases can be the result of infection going back as far as 14-days. Taking a look at a 14-day window will also smooth out our data as a line graph.
To accomplish this we'll need to perform some transformations on our dataset.
We'll want to track observations by:
# Shut down some output information from the summarise function
options(dplyr.summarise.inform = FALSE)
# 1. group our data by public health unit
covid_phu_long.df <- covid_phu_long.df %>% group_by(public_health_unit)
# 2. get a complete list of case dates
case.dates <- unique(covid_phu_long.df$date)
# 3. set up a table to hold our summarised results
phu_window_data.df = data.frame(public_health_unit = character(0),
window_mean = numeric(0),
start_date = numeric(0), end_date = numeric(0))
case_window = 14 - 1
# Iterate through the dates in a 14-day sliding window
for (i in 1:(length(case.dates) - case_window)) {
curr.set <- covid_phu_long.df %>%
# Filter for a set of data that spans 14 days
filter(date %in% case.dates[i:(i+case_window)]) %>%
# Summarize that data based on public health unit
summarize(window_mean = mean(new_cases))
# Track the start and end dates of the window
curr.set$start_date = case.dates[i]
curr.set$end_date = case.dates[i + case_window]
# Add this table to the collected data
phu_window_data.df <- rbind(phu_window_data.df, curr.set)
}
# Check on the final structure of the data
str(phu_window_data.df)
tibble [11,594 x 4] (S3: tbl_df/tbl/data.frame) $ public_health_unit: Factor w/ 34 levels "Algoma","Brant County",..: 1 2 3 4 5 6 7 8 9 10 ... $ window_mean : num [1:11594] 0.5 3.571 0.714 15.357 2.786 ... $ start_date : Date[1:11594], format: "2020-03-24" "2020-03-24" ... $ end_date : Date[1:11594], format: "2020-04-06" "2020-04-06" ...
Now that we've generated our windowed data, let's plot the top 5 PHUs by caseload. Let's also annotate some dates from the 2020 pandemic history:
# Build our plot and save to an object
phu_window.plot <- phu_window_data.df %>%
# Filter for the top 5 infected PHUs
filter(public_health_unit %in% phu_by_total_cases_desc[1:5]) %>%
# redirect the filtered result to ggplot
ggplot() +
aes(x = ..., y = ..., colour = fct_reorder(public_health_unit, ..., .desc=TRUE)) +
geom_line(size=2) +
scale_color_viridis_d() +
theme_bw() + # Simplify the theme
xlab("Date") +
ylab("Mean cases in 14-day window") +
ggtitle("Mean cases in a 14-day window across top 5 Ontario Public Health Units") +
theme(text = element_text(size = 20)) + # set text size
guides(colour = guide_legend(title="Public Health Unit")) + # set our legend name
theme(panel.grid.major.y = element_line(color="grey95")) + # darken our major y grid
theme(panel.grid.minor.y = element_blank()) + # remove our minor y grid
theme(panel.grid.minor.x = element_blank()) + # remove our minor x grid
# Start looking at data from July 2020 onwards
scale_x_date(...,
date_breaks = ..., date_labels = ...) +
# Annotate windows of various milestones
geom_text(aes(x=as.Date("2020-07-31") +7 , label = "Toronto enters stage 3", y=500), angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-07-31"), xmax=as.Date("2020-07-31")+14, ymin=-Inf, ymax=Inf, fill="grey", alpha=0.2, ) +
geom_text(aes(x=as.Date("2020-09-15") + 7, label = "School starts", y=500), angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-09-15"), xmax=as.Date("2020-09-15")+14, ymin=-Inf, ymax=Inf, fill="orange", alpha=0.2) +
geom_text(aes(x=as.Date("2020-12-26") + 7, label = "Province-wide lockdown", y=500), angle=90, size=10, colour="black") +
annotate("rect", xmin=as.Date("2020-12-26"), xmax=as.Date("2020-12-26")+14, ymin=-Inf, ymax=Inf, fill="red", alpha=0.2)
# plot our object to standard output
phu_window.plot
That's our first class! If we've made it this far, we've reviewed
ggplot2We took a "messy" dataset from the Ontario government and created a tidy data set that we were able to graph from. We took that further by transforming the data into a 14-day sliding window of mean new cases per day in each public health unit. This clarified our picture of cases and visually confirmed that spread of SARS-CoV-2 does appear to be mitigated through lockdown orders.
Next week? Getting deeper into ggplot2!
This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 12pm on the date of our next class: March 25th, 2021.
lubridate package: https://r4ds.had.co.nz/dates-and-times.htmlFor this introductory course we will be teaching and running code for R through Jupyter notebooks. In this section we will discuss
As of 2021-01-18, The latest version of Anaconda3 runs with Python 3.8
Download the OS-appropriate version from here https://www.anaconda.com/products/individual
All versions should come with Python 3.8
Windows:
MacOS:
Unix:
As of 2020-12-11, the lastest version of r-base available for Anaconda is 4.0.3 but Anaconda comes pre-installed with R 3.6.1. To save time, we will update just our r-base (version) through the command line using the Anaconda prompt. You'll need to find the menu shortcut to the prompt in order to run these commands. Before class you should update all of your anaconda packages. This will be sure to get you the latest version of Jupyter notebook. Open up the Anaconda prompt and type the following command:
conda update --all
It will ask permission to continue at some point. Say 'yes' to this. After this is completed, use the following command:
conda install -c conda-forge/label/main r-base=4.0.3=hddad469_3
Anaconda will try to install a number of R-related packages. Say 'yes' to this.
Lastly, we want to connect your R version to the Jupyter notebook itself. Type the following command:
conda install -c r r-irkernel
Jupyter should now have R integrated into it. No need to build an extra environment to run it.
You may find that for some reason or another, you'd like to maintain a specific R-environment (or other) to work in. Environments in Anaconda work like isolated sandbox versions of Anaconda within Anaconda. When you generate an environment for the first time, it will draw all of its packages and information from the base version of Anaconda - kind of like making a copy. You can also create these in the Anaconda prompt. You can even create new environments based on specific versions or installations of other programs. For instance, we could have tried to make an environment for R 4.0.3 with the command
conda create -n my_R_env -c conda-forge/label/main r-base=4.0.3=hddad469_3
This would create a new environment with version 4.0.3 of R but the base version of Anaconda would retain version 3.6.1 of R. A small but helpful detail if you are unsure about newer versions of packages that you'd like to use.
Likewise, you can update and install packages in new environments without affecting or altering your base environment! Again it's helpful if you're upgrading or installing new packages and programs. If you're not sure how it will affect what you already have in place, you can just install them straight into an environment.
For more information: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#cloning-an-environment
If you are inclined, the Anaconda Navigator can help you make an R environment separate from the base, but you won't be able to perform the same fancy tricks as in the prompt, like installing new packages directly to a new environment.
Note: You should consider doing this only if you have a good reason to isolate what you're doing in R from the Anaconda base packages. You will also need to have installed r-base 4.0.3 to make a new environment with it through the Anaconda navigator.
The Anaconda navigator is a graphical interface that shows all fo your pre-installed packages and give you access to installing other common programs like RStudio (we'll get to that in a moment).
You will now have an R environment where you can install specific R packages that won't make their way into your Anaconda base.
You will likely find a shortcut to this environment in your (Windows) menu under the Anaconda folder. It will look something like Jupyter Notebook (R-4-0-3)
Normally I suggest avoiding installing packages through your Jupyter Notebook. Instead, if you want to update your R packages for running Jupyter, it's best to add them through either the Anaconda prompt or Anaconda navigator. Again, using the prompt gives you more options but can seem a little more complicated.
One of the most useful packages to install for R is r-essentials. Open up the Anaconda prompt and use the command:
conda install -c r r-essentials. After running, the Anaconda prompt will inform you of any package dependencies and it will identify which packages will be updated, newly installed, or removed (unlikely).
Anaconda has multiple channels (similar to repositories) that exist and are maintained by different groups. These various channels port over regular R packages to a format that can be installed in Anaconda and run by R. The two main channels you'll find useful for this are the r channel and conda-forge channel. You can find more information about all of the packages on docs.anaconda.com. As you might have guessed the basic format for installing packages is this: conda install -c channel-name r-package where
conda-install is the call to install packages. This can be done in a base or custom environment
-c channel-name identifies that you wish to name a specific channel to install from
r-package is the name of your package and most of them will begin with r- ie r-ggplot2
As of 2020-06-25, the latest stable R version is 4.0.3:
Windows:
- Go to <http://cran.utstat.utoronto.ca/>
- Click on 'Download R for Windows'
- Click on 'install R for the first time'
- Click on 'Download R 4.0.3 for Windows' (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the instructions.
(Mac) OS X:
- Go to <http://cran.utstat.utoronto.ca/>
- Click on 'Download R for (Mac) OS X'
- Click on R-4.0.3.pkg (or a newer version)
- Open the .pkg file once it has downloaded and follow the instructions.
Linux:
- Open a terminal (Ctrl + alt + t)
- sudo apt-get update
- sudo apt-get install r-base
- sudo apt-get install r-base-dev (so you can compile packages from source)
As of 2021-01-18, the latest RStudio version is 1.4.1103
Windows:
- Go to <https://www.rstudio.com/products/rstudio/download/#download>
- Click on 'RStudio 1.3.1093 - Windows Vista/7/8/10' to download the installer (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the instructions.
(Mac) OS X:
- Go to <https://www.rstudio.com/products/rstudio/download/#download>
- Click on 'RStudio 1.3.1093 - Mac OS X 10.13+ (64-bit)' to download the installer (or a newer version)
- Double-click on the .dmg file once it has downloaded and follow the instructions.
Linux:
- Go to <https://www.rstudio.com/products/rstudio/download/#download>
- Click on the installer that describes your Linux distribution, e.g. 'RStudio 1.3.1093 - Ubuntu 18/Debian 10(64-bit)' (or a newer version)
- Double-click on the .deb file once it has downloaded and follow the instructions.
- If double-clicking on your .deb file did not open the software manager, open the terminal (Ctrl + alt + t) and type **sudo dpkg -i /path/to/installer/rstudio-xenial-1.3.959-amd64.deb**
_Note: You have 3 things that could change in this last command._
1. This assumes you have just opened the terminal and are in your home directory. (If not, you have to modify your path. You can get to your home directory by typing cd ~.)
2. This assumes you have downloaded the .deb file to Downloads. (If you downloaded the file somewhere else, you have to change the path to the file, or download the .deb file to Downloads).
3. This assumes your file name for .deb is the same as above. (Put the name matching the .deb file you downloaded).
If you have a problem with installing R or RStudio, you can also try to solve the problem yourself by Googling any error messages you get. You can also try to get in touch with me or the course TAs.
RStudio is an IDE (Integrated Development Environment) for R that provides a more user-friendly experience than using R in a terminal setting. It has 4 main areas or panes, which you can customize to some extent under Tools > Global Options > Pane Layout:
All of the panes can be minimized or maximized using the large and small box outlines in the top right of each pane.
The Source is where you are keeping the code and annotation that you want to be saved as your script. The tab at the top left of the pane has your script name (i.e. 'Untitled.R'), and you can switch between scripts by toggling the tabs. You can save, search or publish your source code using the buttons along the pane header. Code in the Source pane is run or executed automatically.
To run your current line of code or a highlighted segment of code from the Source pane you can:
a) click the button 'Run' -> 'Run Selected Line(s)',
b) click 'Code' -> 'Run Selected Line(s)' from the menu bar,
c) use the keyboard shortcut CTRL + ENTER (Windows & Linux) Command + ENTER (Mac) (recommended),
d) copy and paste your code into the Console and hit Enter (not recommended).
There are always many ways to do things in R, but the fastest way will always be the option that keeps your hands on the keyboard.
You can also type and execute your code (by hitting ENTER) in the Console when the > prompt is visible. If you enter code and you see a + instead of a prompt, R doesn't think you are finished entering code (i.e. you might be missing a bracket). If this isn't immediately fixable, you can hit Esc twice to get back to your prompt. Using the up and down arrow keys, you can find previous commands in the Console if you want to rerun code or fix an error resulting from a typo.
On the Console tab in the top left of that pane is your current working directory. Pressing the arrow next to your working directory will open your current folder in the Files pane. If you find your Console is getting too cluttered, selecting the broom icon in that pane will clear it for you. The Console also shows information: upon start up about R (such as version number), during the installation of packages, when there are warnings, and when there are errors.
In the Global Environment you can see all of the stored objects you have created or sourced (imported from another script). The Global Environment can become cluttered, so it also has a broom button to clear its workspace.
Objects are made by using the assignment operator <-. On the left side of the arrow, you have the name of your object. On the right side you have what you are assigning to that object. In this sense, you can think of an object as a container. The container holds the values given as well as information about 'class' and 'methods' (which we will come back to).
Type x <- c(2,4) in the Console followed by Enter. 1D objects' data types can be seen immediately as well as their first few values. Now type y <- data.frame(numbers = c(1,2,3), letters = c("a","b","c")) in the Console followed by Enter. You can immediately see the dimension of 2D objects, and you can check the structure of data frames and lists (more later) by clicking on the object's arrow. Clicking on the object name will open the object to view in a new tab. Custom functions created in session or sourced will also appear in this pane.
The Environment pane dropdown displays all of the currently loaded packages in addition to the Global Environment. Loaded means that all of the tools/functions in the package are available for use. R comes with a number of packages pre-loaded (i.e. base, grDevices).
In the History tab are all of the commands you have executed in the Console during your session. You can select a line of code and send it to the Source or Console.
The Connections tab is to connect to data sources such as Spark and will not be used in this lesson.
The Files tab allows you to search through directories; you can go to or set your working directory by making the appropriate selection under the More (blue gear) drop-down menu. The ... to the top left of the pane allows you to search for a folder in a more traditional manner.
The Plots tab is where plots you make in a .R script will appear (notebooks and markdown plots will be shown in the Source pane). There is the option to Export and save these plots manually.
The Packages tab has all of the packages that are installed and their versions, and buttons to Install or Update packages. A check mark in the box next to the package means that the package is loaded. You can load a package by adding a check mark next to a package, however it is good practice to instead load the package in your script to aid in reproducibility.
The Help menu has the documentation for all packages and functions. For each function you will find a description of what the function does, the arguments it takes, what the function does to the inputs (details), what it outputs, and an example. Some of the help documentation is difficult to read or less than comprehensive, in which case goggling the function is a good idea.
The Viewer will display vignettes, or local web content such as a Shiny app, interactive graphs, or a rendered html document.
I suggest you take a look at Tools -> Global Options to customize your experience.
For example, under Code -> Editing I have selected Soft-wrap R source files followed by Apply so that my text will wrap by itself when I am typing and not create a long line of text.
You may also want to change the Appearance of your code. I like the RStudio theme: Modern and Editor font: Ubuntu Mono, but pick whatever you like! Again, you need to hit Apply to make changes.
That whirlwind tour isn't everything the IDE can do, but it is enough to get started.